Skip to content

Conversation

@claudevdm
Copy link

@claudevdm claudevdm commented Nov 11, 2025

This adds a new CloudPickleConfig argument that is meant to override the default behaviors for users with specific needs like deterministic hash (at the cost of performance regression).

The default config maintains behavior prior to this PR.

The first CloudPickleConfig option is to pass an id_generator that is used to create ID's for dynamic classes and typevars (to maintain isinstance semantics).

If the id_generator is omitted then new types will always be defined and isinstance semantics is lost.

Also included is a hash_dynamic_classdef function that can be passed via CloudPickleConfig.id_generator that pickles the classdef as mentioned in #568 (comment)

The unit test setup is refactored a bit to allow passing configs between processes.

This PR is related to #510 and #453

@claudevdm
Copy link
Author

FYI @tvalentyn

@claudevdm
Copy link
Author

claudevdm commented Nov 11, 2025

@tomMoral @ogrisel
can you please take a look at this proposal?

@ogrisel
Copy link
Contributor

ogrisel commented Nov 13, 2025

I am a bit busy these days. Maybe @tomMoral has the time for a first pass and I could then do a second pass later?

@tvalentyn
Copy link
Contributor

tvalentyn commented Nov 19, 2025

Thanks a lot @claudevdm and @ogrisel ! Very excited to see that we are making steps towards making pickling libraries configurable, as well as efforts to establish best-effort deterministic pickling. I know of several other users who would benefit from these changes.

Hi @tomMoral - would you have time to take a look? Totally understand that it might take some time to find bandwidth - just checking that the notification didn't get lost in your Github inbox. Thanks!

@lesteve
Copy link
Contributor

lesteve commented Nov 20, 2025

Please @claudevdm @tvalentyn the more often you ping, the less likely I am to want to make time to look at it.

This is a +600/-300 PR, this will take time and effort to review.

I don't think it is likely to be at the top of our priority list right now. You may want to try your luck again in a month or so 🙏.

Also it would help if you spend a bit of time to explain your use case and how it would make your life easier.

Copy link
Contributor

@tomMoral tomMoral left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello

thanks a lot for the PR. We have been discussing this for a long time to try to find a way to make it determinist by default without overhead but I agree it makes sense to have this two stage process and get an option to change the behavior.

The PR is super large because of the introduction of the config class and the test infrastructure that gets more complex.
I am in favor of simplifying the PR, by introducing one extra argument instead of the config object (determinist_class_id). this will make the PR more consistent with the style of the library with all arguments explicit. I don't see a usecase for id_generator=None so I would say this is not a loss to only have a boolean argument, but happy to revise my opinion if there is a case.

For the tests, removing this will also require less changes as we will be able to keep class tests as is. This will make the PR much easier to review and to integrate.
Also trying to simplify further and avoid non-necessary changes will help.

Once again, thnaks a lot for the PR, it ahs been a subject on my todo list for a long time :)



def _get_or_create_tracker_id(class_def):
def _get_or_create_tracker_id(class_def, id_generator):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment to explain the logic here, and what is expected in id_generator?



def _decompose_typevar(obj):
def _decompose_typevar(obj, config: CloudPickleConfig):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no typing in the library, I am not in favor of introducing just a few type hints, in particular in private functions. So I would remove (but not a strong opinion).

Default: uuid_generator (generates UUID hex strings).
"""

id_generator: typing.Optional[callable] = uuid_generator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel it is too complicated. I don't think anyone will want to use a custom class id.
I would use a determinist_class_id : bool = False, with a warning on the fact that putting True might make this a little slower.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, this can be improved.

In Apache Beam we do require pickling types without any tracking (a fallback deterministic coder for shuffle keys). Apache Beam already pickles without dynamic class IDs, so removing that option would break update compatibility for long-running streaming pipelines. It would also make pickling of these types more expensive.

I propose a middle ground where we define an Enum here

class DynamicIdMode(Enum):
    UUID = "uuid"
    HASH = "hash"
    NONE = "none"  # internal use only, or explicitly documented as "no tracking"

Then the internal id generating function for the hash case becomes

def hash_dynamic_classdef(classdef):
    return hashlib.sha256(
        dumps(classdef, config=CloudPickleConfig(dynamic_class_id=DynamicIdMode.NONE))
    ).hexdigest()

obj.__covariant__,
obj.__contravariant__,
_get_or_create_tracker_id(obj),
_get_or_create_tracker_id(obj, config.id_generator),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather pass the entire config so it is easy to know from where this comes in the function using the argument.

file,
protocol=None,
buffer_callback=None,
config: CloudPickleConfig = DEFAULT_CONFIG,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not use a config object here.
All arguments are explicit so I would not change this only for one argument.
This makes argument discovery harder as you need to investigate the config object.

Comment on lines +55 to +57
assert (
depickled_type is original_type
), f"Expected {depickled_type} to be of type {original_type}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick

Suggested change
assert (
depickled_type is original_type
), f"Expected {depickled_type} to be of type {original_type}"
assert depickled_type is original_type, (
f"Expected {depickled_type} to be the same type as {original_type}"
)

Comment on lines +2023 to +2024
if not self.should_maintain_isinstance_semantics():
pytest.skip("Test irrelevant due to breaking isinstance semantics")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should go after the docstring

Comment on lines +3174 to +3178
assert_isinstance_semantics(
config_id=config_id,
original_type=SampleDataclass,
depickled=cloned_type
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes are not needed if we don't test for id_generator = None, for which I don't see a usecase.

msg="g now has closure cells even though f does not",
)

def assert_isinstance_semantics(self, original_type, depickled):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind of duplicate from testutils.py. Having only one, or defining one by calling the other would help.

Comment on lines +3194 to +3195
class NoIdGeneratorPickleTest(CloudPickleTest):
config_id = "no_id_generator"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is super interesting to have the option for id_generator = None.

@claudevdm
Copy link
Author

Hello

thanks a lot for the PR. We have been discussing this for a long time to try to find a way to make it determinist by default without overhead but I agree it makes sense to have this two stage process and get an option to change the behavior.

The PR is super large because of the introduction of the config class and the test infrastructure that gets more complex. I am in favor of simplifying the PR, by introducing one extra argument instead of the config object (determinist_class_id). this will make the PR more consistent with the style of the library with all arguments explicit. I don't see a usecase for id_generator=None so I would say this is not a loss to only have a boolean argument, but happy to revise my opinion if there is a case.

For the tests, removing this will also require less changes as we will be able to keep class tests as is. This will make the PR much easier to review and to integrate. Also trying to simplify further and avoid non-necessary changes will help.

Once again, thnaks a lot for the PR, it ahs been a subject on my todo list for a long time :)

Thanks for the feedback.

I originally chose the config approach to separate standard pickle arguments (file, protocol, buffer_callback) from cloudpickle-specific behavior that most users might not care about.

In Apache Beam's fork we've added several cloudpickle-specific settings beyond id_generator (code object parameter handling, dynamic type reconstruction behavior) with the goal to upstream the ones that are valuable to others.

If more options are added the config approach makes it simpler when multiple independent options need to flow through the same function chain.

I agree wiring config through the test infrastructure is ugly and painful so I'd be happy to refactor the change to use flat arguments. I don't have a strong preference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants